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Abstract 



This paper describes an algorithm for fitting finite mixtures of unrestricted Multi- 
variate Skew t (FM-uMST) distributions. The package EMMIX-uskew implements a 
closed-form expectation-maximization (EM) algorithm for computing the maximum like- 
lihood (ML) estimates of the parameters for the (unrestricted) FM-MST model in R. 
EMMIX-uskew also supports visualization of fitted contours in two and three dimensions, 
and random sample generation from a specified FM-uMST distribution. 

Finite mixtures of skew t-distributions have proven to be useful in modelling hetero- 
geneous data with asymmetric and heavy tail behaviour, for example, datasets from flow 
cytometry. In recent years, various versions of mixtures with multivariate skew t (MST) 
distributions have been proposed. However, these models adopted some restricted char- 
acterizations of the component MST distributions so that the E-step of the EM algorithm 
can be evaluated in closed form. This paper focuses on mixtures with unrestricted MST 
components, and describes an iterative algorithm for the computation of the ML esti- 
mates of its model parameters. Its implementation in R is presented with the package 
EMMIX-uskew. 

The usefulness of the proposed algorithm is demonstrated in three applications to 
real data sets. The first example illustrates the use of the main function fmmst in the 
package by fitting a MST distribution to a bivariate unimodal flow cytometric sample. 
The second example fits a mixture of MST distributions to the Australian Institute of 
Sport (AIS) data, and demonstrate that EMMIX-uskew can provide better clustering 
results than mixtures with restricted MST components. In the third example, EMMIX- 
uskew is applied to classify cells in a trivariate flow cytometric datasct. Comparisons 
with other available methods suggests that the EMMIX-uskew result achieved a lower 
misclassification rate with respect to the labels given by benchmark gating analysis. 

Keywords: mixture models, skew distributions, multivariate i-distribution, EM algorithm, 
flow cytometry, R. 
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1. Introduction 

In many practical problems, data are often skewed, heterogeneous, and/or contains outliers. 
Finite mixture distributions of skewed distributions have become increasingly popular in mod- 
elling and analyzing such data. This use of finite mixture distributions to model heterogeneous 
data has undergone intensive development in the past decades, as witnessed by the numerous 
applications in various scientific fields such as bioinformatics, cluster analysis, genetics, infor- 
mation processing, medicine, and pattern recognition. For a comprehensive survey on mixture 
models and their applications see, for example, the monographs by Everitt and Hand (1981), 
Titterington et al. (1985), McLachlan and Basford (1988), Lindsay (1995), Bohning (2000), 
McLachlan and Peel (2000), Frtihwirth-Schnatter (2006), and Mengersen et al. (2011), and 
also the papers by Banfield and Raftery (1993) and Fraley and Raftery (1999). 

In recent years, finite mixtures of skew t-distributions have been exploited as an effective 
tool in modelling high-dimensional multimodal and asymmetric dataset; see, for example, 
Pyne et al. (2009a) and Frtihwirth-Schnatter and Pyne (2010). Following the introduction of 
the skew-normal (SN) distribution by Azzalini (1985), several authors have studied skewed 
extensions of the t-distribution. Finite mixture models with multivariate skew t (MST) com- 
ponents was first proposed by Pyne et al. (2009a) in a study of an automated approach to the 
analysis of flow cytometry data. Wang (2009) has given a package EMMIX-skew for the im- 
plementation in R (R Development Team 2011) of their algorithm. More recently, Basso et al. 
(2010) studied a class of mixture models where the components densities are scale mixtures of 
univariate skew-normal distributions, known as the skew- normal/independent (SNI) family of 
distributions, which include the (univariate) skew-normal and skew t-distributions as special 
cases. This work was later extended to the multivariate case in Cabral et al. (2012), and was 
implemented in an R package mixsmsn. However, in these characterizations, restrictions were 
imposed on the component skew t-distributions in order to obtain manageable analytical ex- 
pressions for the conditional expectations involved in the E-step of the EM algorithm. These 
versions of the skew t-distributions are known as the 'restricted' form of the MST distribution; 
see Lee and McLachlan (2012a) for further discussions on this. 

In this paper, we present a fitting algorithm for the unrestricted skew t-mixture model. We 
show that an EM algorithm can be implemented exactly without restricting the character- 
izations of the component MST distributions. Closed form expressions can be obtained for 
the E-step conditional expectations by recognizing that they can be formulated as moments 
of a multivariate non-central truncated t-variate, which can be further expressed in terms of 
central t-distributions. The algorithm is implemented in R in the package EMMIX-uskew. 

The package EMMIX-uskew consists of three main functions: fmmst, rfmmst, and con- 
tour. fmmst. The main routine fmmst fits a mixture of unrestricted MST (uMST) distributions 
using an EM algorithm described in Section 3. The function rfmmst generates random samples 
from mixtures of uMST distributions. For a user friendly visualisation of the fitted models, 
fmmst. contour provides 2D contour maps of the fitted bivariate densities and 3D displays with 
interactive viewpoint navigation facility for trivariate densities. 

The remainder of this paper is organized as follows. Section 2 provides a brief description of the 
uMST distribution and defines the FM-uMST model. Section 3 presents an EM algorithm for 
fitting the FM-uMST model. In the next section, an explanation of how to fit, visualize, and 
interpret the FM-uMST models using EMMIX-uskew is presented. For illustration purposes, 
EMMIX-uskew is applied to three applications and comparisons are made with some restricted 
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FM-MST models and other clustering methods. Finally, we conclude with a brief summary 
of our results. 



2. Finite mixtures of multivariate skew ^-distributions 

We begin by defining the (unrestricted) multivariate skew t-density. Let Y be a p-dimensional 
random vector. Then Y is said to follow a p-dimensional unrestricted skew i-distribution 
(Sahu et al. 2003) with p x 1 location vector fx, p x p scale matrix S, p X 1 skewness vector 
S, and (scalar) degrees of freedom u, if its probability density function (pdf) is given by 



f p (y; ii, E, S, v) = 2Ptp, v (y; fi, ft) T PjV+p (y*; 0, A) , 



(1) 



where 



A = diag(tf), 
ft = £ + A 2 , 



y 



v + p 



v + d{ v y 

q = Afi-^y - /i), 
d(y) = (y - tifft'^y - M ), 
A = I p - Aft' 1 A. 

Here the operator diag(<5) denotes a diagonal matrix with diagonal elements specified by the 
vector 6. Also, we let t PtU {.; /x, S) be the p-dimensional t-distribution with location vector n, 
scale matrix X, and degrees of freedom z^, and T PtU (.; /x, S) the corresponding (cumulative) 
distribution function. The notation Y" ~ uMST Pj ^(/x, S, <5) will be used. Note that when 
6 = 0, (1) reduces to the symmetric t-density t p>1/ (y; fx, X). Also, when z/ — > oo, we obtain 
the (unrestricted) skew normal distribution. 

Various versions of the multivariate skew t-density have been proposed in recent years. It is 
worth noting that the versions considered by Azzalini and Dalla Valle (1996), Gupta (2003), 
and Lachos et al. (2010), among others, are different from (1). These versions are simpler in 
that the skew t-density is defined in terms involving only the univariate t-distribution function 
instead of the multivariate form of the latter as used in (1). These simplified characterizations 
have the advantage of having closed form expressions for the conditional expectations that 
have to be calculated on the E-step. The reader is referred to Lee and McLachlan (2012a, b) 
for a discussion on different forms of skew t-distributions. We shall adopt the unrestricted 
form (1) of the MST distribution here as proposed by Sahu et al. (2003), and describe a 
computationally efficient EM algorithm for fitting such model. 

A ^-component finite mixture of uMST distributions has density given by 

9 

f {y\ *) = 2 ^hfp (y; s /n S h , v h ) , (2) 

h=l 

where f p (y; n h , S^, S^, v^) denotes the /ith uMST component of the mixture model as de- 
fined by (1), with location parameter fi h , scale matrix S/j, skew parameter Sh, and degrees 
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of freedom v^. The mixing proportions satisfy > (h = 1, . . . ,g) and Ylh=i n h = 1- 
We shall denote the model defined by (2) by the FM-uMST (finite mixture of uMST) dis- 
tributions. Let \l/ contain all the unknown parameters of the FM-uMST model; that is, 
^ = (iri, . . . , Trg-i,6i , . . . , where now Oh consists of the unknown parameters of the 

hth. component density function. The density values for a uMST and FM-uMST distribution 
can be evaluated using the functions dmst and df mmst in EMMIX-uskew. 

Random samples of uMST variates can be generated by adopting a stochastic representation 
of (1) (Lin 2010). If Y ~ uMST PyU (fi, £, S), then 

Y = A\U 1 \ + -!=U , (3) 

where the random variables 

C7 ~iV p (0,E), (4) 

£/!~iV p (0,I p ), (5) 

w ~ gamma ~J , (6) 

are independent, and gamma(a, /?) denotes the gamma distribution with shape and scale 
parameters given by a and f3 respectively. Sampling of uMST and FM-uMST variates are 
implemented in EMMIX-uskew in the rmst and rf mmst functions, respectively. 



3. The EMMIX-uskew algorithm 

From (3) to (6), the uMST distribution admits a convenient hierarchical characterization 
that facilitates the computation of the maximum likelihood estimator (MLE) of the unknown 
model parameters using the EM algorithm, namely, 

Yj | Uj , Wj 

Uj | Wj 
Wj 

where = diag^^), HNplp,^) denotes the p-dimensional half-normal distribution with 
location parameter \i and scale matrix H, and 7r = (tti, . . . , vr 9 ) T . 



N p I n + Auj, — S 



~ HN n ( 0, —I 



p / ' 

Wj 
'V v\ 

' o' 2/ ' 



3.1. Fitting of FM-uMST model via the EM algorithm 

To formulate the estimation of the unknown parameters as an incomplete-data problem in the 
EM framework, we introduce a set of latent component labels Zj = (z\j, . . . ,z g j) (j = 1, . . . , n) 
in addition to the unobservable variables Uj and Wj, where each element z^j is a zero-one 
indicator variable with Zhj = 1 if yj belongs to the /ith component, and zero otherwise. 
Thus, Ylh=i Zh i = 1 C? = 1' • • • ' n )- ft follows that the random vector Zj corresponding to 
Zj follows a multinomial distribution with one trial and cell probabilities 7ri, . . . ,7r g ; that is, 
Zj ~Mult g (l;7ri,...,7r s ). 
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The complete-data log likelihood function can be factored into the marginal densities of the 
Zj , the conditional densities of the Wj given Zj , and the conditional densities of the y 3 given 
Uj, Wj, and Zj. Accordingly, the complete-data log likelihood is given by 



where 



log L c (*) = log Lie (*) + log L 2c (*) + log L 3c (*) 



(7) 



Llc(*) 
^2c (*) 



53 «hj log (TTfc) 
9 n 

53 23 Zh i 

h=i j=i 

logr 

3 



log 



+ 17 + P - 1 log (tUj 



^3c (*) 



53 13 

h=i i=i 



1 



-plog(2vr) - -log|S/j 



h y ? ) + («,- 



}- 



(8) 



and where 



dh 



2 



[vi) = {vj - t*h) n h 1 (wj - /**) 

Qhj = {Vj 
Ah = I p 

Here \I/ contains all the unknown parameters of the FM-uMST model. 

The implementation of the EM algorithm requires alternating repeatedly the E- and M-steps 
until convergence in the case where the changes in the log likelihood values are less than 
some specified small value. The E-step calculates the expectation of the complete-data log 
likelihood given the observed data y using the current estimate of the parameters, known as 
the Q-function, given by 

= ^ w {logL c (*)|y}. 

The M-step then maximizes the Q-function with respect to the parameters 

On the (k + l)th iteration, the E-step requires the calculation of the conditional expectations 



„(*) 
,(*) 

-2,j 
,(*) 



E eik) {Wj 



Vj)> 



4] = E 9W (WjUjUj | y 3 ) . 



(9) 
(10) 

(11) 



The posterior probability of membership of the hth component by y 3 , using the current 
estimate \I>( fc ) for is given using Bayes' Theorem by 



_(*) 



(fc) ,M 



( fc ) t ( ( k ) ^( k ) x( k ) ( fc ) 



(12) 
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It can be shown that the conditional expectations e[ fc j , e^- , and e^j are given by 



„(*) 

'Ui 



'2,hj 



v h +P 



T 



\ ( fc ) , A k ) 



(k) / uj k) +p+2 



(*) 



n a (fc) 

y hj ;°, A h 



(13) 
(14) 



and 



(15) 



where is a p-dimensional i-variate truncated to the positive hyperplane K + , which is 
distributed as 



X 



~ tt p,u™ + p+2 



(16) 



The truncated moments E(X) and E(XX T ) can be swiftly evaluated by noting that they can 
be expressed in terms of the distribution function of a (non-truncated) multivariate central 
i-random vector; Lee and McLachlan (2011, 2012a). Recently, Ho et al. (2012) have con- 
sidered the moments of of the doublt truncated multivariate i-distribution, but their result 
corresponding to (16) would appear to be incorrect (see Lee and McLachlan (2012a)). 

The (k + l)th M-step consists of maximization of the Q-function with respect to It follows 
that an updated estimate of the unknown parameters of the FM-uMST model is given by 



(fe) 



s(*+i) 



En 
.7=1 1 



(k) 
hj 



,(*) 



%hjVj 



A ( k )J k ) 

h ^2,hj 



.(*)-(*) 



En 
.7=1 



(fe) (fe) 



3=1 



diag | E« _1 



3=1 



.(*) 



(k+l)\ Jfc) 



(17) 

(18) 



and 



,(fe+i) 



En 
7=1 



.(*) 

hj 3=1 



.(*) 
hi 



A (fc+1) (fc) T A (fc+l) T 



r ) 4i(^-^ +i) ) T +(^-^ +1) ) (W 1 



(fc+i)^ e (fc) T A (fc+i) 



,(*) 

'1,^3 



(19) 



where denotes element-wise matrix product. Note that (18) and also (16) are given incor- 
rectly in Lee and McLachlan (2011). 

An update of the degrees of freedom is obtained by solving iteratively the equation 
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where is the Digamma function. 

This algorithm is implemented as the fmmst function in EMMIX-uskew. 



3.2. Choosing initial values 

It is important to obtain suitable initial values in order for fmmst to converge quickly. In 
EMMIX-uskew, starting values for the model parameters are based on an initial clustering 
given by A;- means. Twenty attempts of /c-means are performed, and the starting component 
labels Zj (j = 1, ... ,n) are initialized according to the clustering result with the highest 
relative log likelihood (see Lee and McLachlan (2012a)). The other parameters are initialized 
as follows: 

E<°> = S h -(a-l)dizg(s h ), 
(1 - a)vr 

hi 



* (0) = S i g n(W^T* 



7T 

2 



40, 



where Sh is the sample covariance of the /ith component, and where -y h is the sample skewness 
of the /ith component, whose ith. element is given by 



(i = 1, p), 



and where yij denotes the ith element of the jth observation, and \x% is the ith element of /x. 
Here, denotes the vector created by extracting the main diagonal of Sh, and the vector s k is 
created by taking the square root of each element in Sh- The scalar a is varied systematically 
across the interval (0, 1) to search for a (relatively) optimal set of starting values for the model 
parameters. 

3.3. Stopping rule 

EMMIX-uskew adopts the traditional stopping criteria which is based on the relative change 
in the size of the log likelihood. The algorithm is terminated when the relative difference in 
log likelihood values between two successive iteration is less than a specified tolerance e, that 
is, the EM algorithm is stopped if 



|L( fe +!)| 



where L^ k+1 ^ denotes the log likelihood value at the (fc + l)th iteration. The default tolerance 
is e = 10 -6 , but the user can specify a different value. 
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4. Using the EMMIX-uskew package 

The parameters of the FM-uMST model in EMMIX-uskew are specified as a list structure 
containing the elements described in Table 1. The parameters /x, S, and S are each imple- 
mented as a list of g matrices, where g is the number of components in the fitted model. For 
example, mu[[2]] is a p x 1 matrix representing /i 2 . Each sigma[[h]] (h = 1, . . . , g) is a 
p x p matrix representing the symmetric positive definite scale matrix of the hth component. 
The parameters nu and pro are g by 1 arrays, representing the vector of degrees of freedom 
and the vector of mixing proportions for each component, respectively. 



parameter R syntax Dimensions Description 





mu 


V 


X 


1 X 


9 


the location parameter 


£ 


sigma 


P 


X 


p X 


9 


the scale matrix 


5 


delta 


P 


X 


1 X 


9 


the skewness parameter 


V 


dof 




9 


X 1 




the degrees of freedom 


TT 


Pi 




9 


X 1 




the mixing proportions 



Table 1: Structure of the model parameters in EMMIX-uskew. 



The probability density function of a multivariate skew t-distribution is calculated by the 
dmst function. The parameter dat is an n x p matrix, containing the coordinates of the n 
point (s) at which the density is to be evaluated. The following command will return a vector 
of n density values. 

dmst (dat, mu, sigma, delta, dof) 

For a FM-uMST density, the function df mmst can be used. 

dfmmst (dat , mu, sigma, delta, dof, pro) 

4.1. Generating samples from a FM-uMST distribution 

Consider generating a random sample of n p-dimensional uMST observations, with location 
parameter /x, scale matrix S, skewness parameter S, and degrees of freedom v. The function 
rfmmst supports two types of inputs - the parameters can be passed as separate arguments, 
or as a single list argument known with elements as specified in Table 1: 

rfmmst(g, n, mu, sigma, delta, dof, pro, known=NULL, ...) 

As an example, suppose that \i = (1, 2) T , S is the identity matrix, S = (— 1, and v = 4. 
Then the following command will generate a random sample of 500 observations from the 
uMST2^(n, S, S) distribution, 

fi> rfmmst (1,500, c(l,2), diag(2) , c(-l,l), 4, 1) 

To generate a mixture of uMST random samples, the above command can be issued. Alter- 
natively, the parameters can be specified in a list structure (Table 1) obj as follows: 
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R> obj <- list() 

R> obj$mu <- list (c (17 ,19) , c(5,22), c(6,10)) 

R> obj$sigma <- list(diag(2) , matrix(c(2, 0,0,1) ,2) , matrix(c(3,7,7,24) ,2)) 

R> obj$delta <- list(c(3,1.5), c(5,10), c(2,0)) 

R> obj$dof <- c(l, 2, 3) 

R> obj$pro <- c(0.25, 0.25, 0.5) 

R> rfmmst(3, 500, known=obj) 

An output of the rf mmst function consists of p + 1 columns. The first p columns are the 
coordinates of the generated sample. The last column indicates from which component each 
data point is generated. Executing the above command will generate an output similar to the 
following: 







[,1] 




[,2] 


[,3] 


[1,] 


19 


.91520 


20, 


,48515 


1 


[2,] 


72 


.81161 


33, 


,41381 


1 


[3,] 


17 


.02193 


23, 


,29119 


1 


[4,] 


23 


.53926 


19, 


,27946 


1 


[5,] 


16 


.85195 


21, 


,21340 


1 


[6,] 


18 


.01906 


18, 


,16612 


1 


[7,] 


22 


.23609 


21, 


,12174 


1 


[8,] 


44 


. 65444 


28, 


,23259 


1 


[9,] 


18 


.18883 


26, 


,72330 


1 


[10,] 


20 


.18908 


18, 


,97005 


1 






rest omitted . . . 





4.2. Fitting a single multivariate skew t-distribution 

To fit a specified FM-uMST model, the core function in EMMIX-uskew, fmmst, is used. This 
implements the algorithm described in Section 3. A typical function call of fmmst is: 

fmmst(g, dat, initial=NULL, known=NULL, itmax=100, eps=le-6, debug=TRUE) 

The main arguments used within this function are: 

• g: a scalar that specifies the number of uMST components to be fitted. 

• dat: an n x p matrix containing the data. 

• initial: a list that specifies the initial values used to start the algorithm. 

• known: a list that specifies any model parameters that are known and so not required 
to be estimated. 

• itmax: a scalar that specifies the maximum number of iterations to be used. 

• eps: a scalar that specifies the termination criterion of the EM algorithm loop. 
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Note that if the initial values of the model parameters are provided by the user, the argument 
initial is expected to be structured as described in Table 1. Similarly, known is expected 
to have the same structure. When initial=NULL, fmmst will generate a set of initial values 
using the procedure described in Section 3.2. Any parameters specified in known are taken 
as known parameters and hence are not estimated by fmmst. There is no need to specify the 
values of all the parameters in initial and known when only some of the parameters are 
known. Parameters that are not specified during the function call are estimated by fmmst. 
The termination criteria for the EMMIX-uskew algorithm is controlled by the parameters 
itmax and eps. The EM loop terminates when either one of the two criterion is satisfied, 
whichever occurs first: (a) the EM loop reaches itmax iterations (default is 100 iterations), or 
(b) the relative difference between the current log likelihood value and that from the previous 
iteration is smaller than eps (default is le-6). The last argument of fmmst is debug. When 
the option debug is set to TRUE (default), fmmst prints the log likelihood value at each iteration 
and displays a summary of the parameters of the fitted model after termination. To turn off 
debug mode, simply set debug=FALSE. For further details of the arguments of fmmst, the 
reader is referred to the documentation of fmmst. This can be accessed by typing ?fmmst at 
the R command prompt. 

We consider now the T-cell phosphorylation dataset (Maier et al. 2007) as an example of 
asymmetrically distributed data, available from Pyne et al. (2009b). The data contain mea- 
surements of blood samples stained with four antibodies, CD4, CD45RA, SLP76, and ZAP70. 
For illustration, we randomly select 500 observations and focus on two of the variables, CD4 
and ZAP70. To fit a MST model to this bivariate Lymphoma dataset, under the default 
settings, the following command is issued: 

R> set. seed (12345) 
R > data ("Lympho") 

R > LymphoSample <- Lympho [sample (1 :nrow (Lympho) , 500),] 
R > Fit <- fmmst (1, LymphoSample) 

A summary of the output of the fitted model can be obtained using the summary function. 
This prints the values of the fitted model parameters for each component. For a fitted uMST 
model, the weighting proportion (which is 1) is not printed. The following output shows a 
typical summary of a fitted single component uMST model. 

R > summary (Fit) 

Finite Mixture of Multivariate Skew t-Distribution 
with 1 component 



Mean: 

[1,] 4.808245 
[2,] 5.500559 



Scale matrix: 
[[1]] 



[,2] 
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[1,] 0.06778378 0.03721489 
[2,] 0.03721489 0.04811898 



Skewness parameter: 
C.l] 

[1,] -0.7082174 
[2,] -0.7990700 



Degrees of freedom: 
5.851434 

> summary (Fit) 

Finite Mixture of Multivarate Skew t-Distributions 
with 1 component 

Mean: 

[,1] 

[1,] 4.808245 
[2,] 5.500559 

Scale matrix: 
[[1]] 

[,1] [,2] 
[1,] 0.06778378 0.03721489 
[2,] 0.03721489 0.04811898 

Skewness parameter: 
C.l] 

[1,] -0.7082174 
[2,] -0.7990700 

Degrees of freedom: 
5.851434 

To view a more detailed output of the fmmst function, the print function is called. This out- 
puts a list containing 11 elements. The first five elements give the estimates of the parameters 
of the fitted FM-uMST model, as described in Table 1. 

The posterior probability of component membership is given by the output argument tau, 
a g x n matrix where the rows corresponds to the component number. The final partition 
of each data point, based on tau, is stored as clusters. The value of the log likelihood 
function, evaluated with the current parameter estimates, is given by loglik. The last two 
arguments aic and bic are the values of the Akaike information criterion (AIC) and the 
Bayes information criterion (BIC), respectively. The following output shows an excerpt from 
the second part of the print output of the fitted model. 
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R> print (Fit) 

Finite Mixture of Multivariate Skew t-Distributions 
with 1 component 

. . . first five components omitted . . . 
$tau 

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] 
[1,] 111111111 1 1 1 1 1 

. . . rest omitted . . . 

$clusters 

[1] 111111111 1 1 1 1 1 

. . . rest omitted . . . 

$ik 

[1] -880.7115 
$aic 

[1] 1777.423 
$bic 

[1] 1811.14 



4.3. Fitting mixtures of multivariate skew t-distributions 

This section presents an illustration of fitting a mixture of unrestricted skew t-distributions 
to some bivariate bimodal asymmetric data. We consider the Australian Institute of Sport 
(AIS) data from Cook and Weisberg (1994), where thirteen body measurements on 102 male 
and 100 female athletes were recorded. In this demonstration, we consider the clustering of 
the data with a two component skew i-mixture model based on the two variables Height and 
Body fat. By setting debug=TRUE, we can examine the value of the log likelihood function at 
each iteration. 

fi> Fit2 <- f must (2, ais[,c(2,12)] , debug=TRUE) 

Finite Mixture of Multivariate Skew t-Distributions 
with 2 components 



Iteration 1 : loglik = -1372.711 
Iteration 2 : loglik = -1370.495 
Iteration 3 : loglik = -1368.773 
Iteration 4 : loglik = -1367.392 
Iteration 5 : loglik = -1366.251 
. . . rest omitted . . . 
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Iteration 100 : loglik = -1343.541 




[,2] 



[1,] 181.91720 181.448906 
[2,] 13.67975 5.814277 

Component scale matrices: 
[[1]] 

C.l] [,2] 
[1,] 26.32964 18.75208 
[2,] 18.75208 16.12990 

[[2]] 



[1,] 61.598537 2.3177048 
[2,] 2.317705 0.1515253 

Component skewness parameters : 

[,1] [,2] 
[1,] -9.582015 3.591328 
[2,] 5.975328 5.729339 

Component degrees of freedom: 
60.03386 28.94895 

Component mixing proportions: 
0.4102178 0.5897822 

We compare the results with two other model-based clustering methods provided by the 
package mixsmsn (Prates et al. 2011) and EMMIX-skew (Wang 2009). As mentioned previ- 
ously, this two models are based on mixture of restricted versions of the multivariate skew 
i-distributions. The first model adopts the skew-normal/independent skew i-distribution 
(Cabral et al. 2012) as its component densities, which is equivalent to the restricted skew 
i-distribution (Pyne et al. 2009a) used in the second model. However, it should be noted 
that, in the ECME algorithm implemented in the package mixsmsn, the component degrees 
of freedom are constrained to be the same. A comparison of the table of cluster labels to 
the true class labels (given by ais$Sex in this example) reveals that the FM-uMST model 
has a higher number of correct allocations (183 compared to 161 and 157 given by mixsmsn 
and EMMIX-skew respectively). Thus, the unrestricted FM-uMST model in EMMIX-uskew 
gives a more accurate clustering in this case. 

fl> library ("mixsmsn") 

R> Fit3 <- smsn.mmix(ais[c(2,12)] , g=2, family="Skew.t", group=TRUE) 

R> Fit4 <- EmSkew(ais [c(2 , 12)] , 2, "mst", debug=FALSE) 
R> table (ais$Sex, Fit3$group) 



1,11 



[,2] 



1 2 
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91 11 

1 29 71 

R> table (ais$Sex, Fit4$clust) 
1 2 

89 13 

1 32 68 

R> table (ais$Sex, Fit2$clusters) 
1 2 

5 97 

1 86 14 



4.4. Discriminant analysis 

Discriminant analysis based on a specified FM-uMST model can be performed using the 
fmmstDA function. 

fmmstDA(g, dat, model) 

The data in dat is assigned to the cluster corresponding to the component of the FM-uMST 
model with the highest posterior probability. Specifications of the model parameters must 
be provided in model, which is typically an output from fmmst. The following commands 
shows an example using fmmstDA. A random sample of FM-uMST variables is generated from 
rfmmst, the first part of which is used as training set, and the second is a testing set. The 
FM-uMST model fitted to the training set is then used for classifying the data in the testing 
set. 

R> set .seed (732) 

R> X <- rfmmst (3, 200, known=obj) 
R> Ind <- sample (l:nrow(X) ,175) 
R> train <- X[Ind,] 
R> test <- X[-Ind,] 

R> trainmodel <- fmmst(3, train[,l:2]) 
R> fmmstDA(3, test[, 1:2], trainmodel) 
R> results <- fmmstDA (3, test [, 1 : 2] , trainmodel) 
R> table (test [, 3] , results) 
results 
12 3 
10 6 

2 5 

3 13 1 



4.5. Visualization of fitted contours 

The EMMIX-uskew package supports visualization of the contours of a FM-uMST model in 
2D and 3D. The plots are generated by the functions fmmst . contour . 2d and fmmst . contour . 3d, 
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f mmst . contour. 2d (dat, model, grid=50, drawpoints=TRUE, clusters=NULL, 

levels=10, component=NULL, map=c ("scatter", "heat", "cluster"), ...) 

f mmst. contour .3d (dat, model, grid=20, drawpoints=TRUE, clusters=NULL, 
levels=0.9, component=NULL , ...) 

In f mmst . contour . 2d (f mmst . contour . 3d), the first argument dat is a matrix of coordinates 
with two (three) columns. The second argument model is a list of at least six elements: the 
five model parameters, and a vector of cluster lables clusters. The grid size is determined 
by grid. By default, the data points are included in the plot. If only the contour is re- 
quired, the option drawpoints=FALSE should be set. When including the points in a plot, 
clusters specifies the component labels of each point according to which the data points 
will be coloured. The argument levels is either an integer specifying the number of contour 
lines to be plotted, or a vector of quantile values. For f mmst . contour . 3d, only the 90th 
percentile contour is plotted by default. If more contours is required, the argument levels 
should be a vector of the required quantiles. For example, if a plot of the 25 th , 50 th , and 75 th 
quantiles are required, then levels = c(0.25, 0.5, 0.75). Bivariate data have the option 
of being plotted as an intensity map instead of scatter plot. This can be obtained by setting 
map="heat". There is also an option for plotting a cluster map of a fitted model using the op- 
tion map=" cluster". Plots for specific components of a mixture model can be requested with 
the argument component. When component=NULL (which is default), the mixture contour is 
plotted. When component is a vector with length between 1 and g, the specified components 
are plotted and the mixing proportion is not taken into account. The last argument of the 
f mmst . contour functions ". . ." allows the user to pass additional arguments to the plot 
function, such as the colour and size of the points. 

Figure la shows the contour of the fitted MST model to the Lymphoma data. Here, a heatmap 
of the original data is used. This plot can be generated via the command, 

fi> f mmst . contour .2d (Lympho , model=Fit , map="heat" , 
xlab="SLP76", ylab="ZAP70") 

The default f mmst . contour . 2d function will return a scatter plot of the data in 2D super- 
imposed with the contours of the fitted mixture model. For example, the following command 
generates a contour plot of the fitted FM-uMST model to the ais data in Section 4.2 (Fig- 
ure lb). Note that fmmst . contour . 2d coloured the sample points according to the clustering 
given by the argument clusters: 

R> label <- ais$Sex 

R> label [label==0] <- 2 

R> fmmst . contour . 2d (ais [,c (2, 12)] , model=Fit2, clusters=label , 
xlab="Ht", ylab="Bf at") 

Suppose we are interested in visualizing a clustering map of the fitted model to the simulated 
data in Section 4.4. This plot can be generated by issuing the following command. 

R> fmmst . contour .2d (X, model=trainmodel , clusters=X[, 3] , map=" cluster" , 
component=l :3) 
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Figure 1: 2D contour plots generated by the fmmst . contour . 2d function, (a) The fitted 
contour of the single component uMST model plotted over the hue intensity diagram of 
the Lymphoma dataset; (b) the default mixture contour plot of the fitted two-component 
FM-uMST model of the AIS dataset; (a) the contour of the individual components of the 
three-component model fitted to a bivariate synthetic sample plotted over the cluster map of 
the sample. 

The output is given in Figure lc. Note that the grid size of the cluster map is also controlled 
by the argument grid whose default value is 50. 

To demonstrate the use of fmmst . contour . 3d, we consider the clustering of a trivariate Diffuse 
Large B-cell Lymphoma (DLBCL) dataset provided by the British Columbia Cancer Agency. 
The data contain over 8000 cells derived from the lymph nodes of patients diagnosed with 
DLBCL. The samples were stained with three markers CD3, CD5, and CD19. The task is 
to cluster the cells into four groups. Hence we fit a four-component FM-uMST model to the 
data. The maximum number of iterations was increased to 300. 

A scatterplot of the data is shown in Figure 2, where the dots are coloured according to the 
clustering provided by human experts, which is taken as the 'true' class labels. Figure 2b 
shows the 95 th percentile density contours of the four components of the fitted model which 
are displayed with matching colours. The 3D plot uses the rgl visualization device system, 
and hence supports user friendly interactive navigation. The plots can be rotated in real-time 
to select a suitable viewpoint. The following code can be used to generate the 3D plots in 
Figure 2. 

R> Fit4 <- fmmst (DLBCL , 4, itmax=300) 

R> fmmst. contour. 3d (DLBCL, model=Fit4, level=0.9, drawpoints=FALSE, 
xlab="FLl.LOG", ylab="FL2 .LOG" , zlab="FL4.L0G" , quant ile=0 . 95) 

Results from manual expert gating enable us to calculate an error rate of clustering, which 
is measured by choosing among the possible permutations of cluster labels the one that gives 
the best performance. Note that dead cells were removed before evaluating the error rate 
against the benchmark results. For comparison, we calculated the error rate associated with 
the clustering results given by three other methods - FLAME (Pyne et al. 2009a), flowClust 
(Lo et al. 2008) and flowMeans (Aghaeepour et al. 2011). From Table 2, the FM-uMST model 
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Method 


FLAME 


flowClust 


flowMeans 


FM-uMST 


Error rate 


0.074 


0.076 


0.140 


0.046 



Table 2: Error rate of misclassification of four methods for the DLBCL dataset. 



FL2.LOG t 




(a) 8oo (b) 



Figure 2: 3D contours plot of the DLBCL dataset generated by the fmmst . contour .3d func- 
tion, (a) A scatterplot of the data coloured according to the the true clustering labels of the 
DLBCL dataset; (b) fitted contour of the three component FM-uMST model for the DLBCL 
dataset. 

clearly shows superior performances in this dataset. 



5. Concluding remarks 

We have presented the R package EMMIX-uskew for fitting finite mixtures of unrestricted 
multivariate skew t-distributions to heterogeneous asymmetric data. The package implements 
a closed-form EM algorithm for fitting FM-uMST models and provides user-friendly visual- 
ization of the fitted contours in 2D and 3D. The major features of the software have been 
demonstrated on three real examples on the T-cell phosphorylation data, the Australian In- 
stitute of Sports (AIS) data and the DLBCL dataset. The clustering results were compared 
to mixture models with restricted multivariate skew t components and other methods. In 
both the AIS and DLBCL illustrations, the unrestricted model gave better clustering results 
with respect to the true class labels. 
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